Exploratory Data Analysis¶
# Libraries for data manipulation
import pandas as pd
import numpy as np
# Libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Load the Coffee Dataset
df = pd.read_csv(r"C:\Users\nanav\Downloads\Coffee Dataset.csv")
Data Overview¶
Here, we'll first look at the Coffee dataset. We'll check the first and last few rows to understand what the data looks like.¶
df.head(10)
| Submission ID | What is your age? | How many cups of coffee do you typically drink per day? | Where do you typically drink coffee? | Where do you typically drink coffee? (At home) | Where do you typically drink coffee? (At the office) | Where do you typically drink coffee? (On the go) | Where do you typically drink coffee? (At a cafe) | Where do you typically drink coffee? (None of these) | How do you brew coffee at home? | ... | What is the most you'd ever be willing to pay for a cup of coffee? | Do you feel like you’re getting good value for your money when you buy coffee at a cafe? | Approximately how much have you spent on coffee equipment in the past 5 years? | Do you feel like you’re getting good value for your money with regards to your coffee equipment? | Gender | Education Level | Ethnicity/Race | Employment Status | Number of Children | Political Affiliation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | gMR29l | 18-24 years old | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | BkPN0e | 25-34 years old | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Pod/capsule machine (e.g. Keurig/Nespresso) | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | W5G8jj | 25-34 years old | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Bean-to-cup machine | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 4xWgGr | 35-44 years old | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Coffee brewing machine (e.g. Mr. Coffee) | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | QD27Q8 | 25-34 years old | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Pour over | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | V0LPeM | 55-64 years old | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Pod/capsule machine (e.g. Keurig/Nespresso), E... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 6 | V0Gaxg | 18-24 years old | NaN | At a cafe, At the office, At home, On the go | True | True | True | True | False | Pour over, French press, Espresso, Instant cof... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 7 | AdzRL0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 8 | LbWda2 | 25-34 years old | Less than 1 | At a cafe | False | False | False | True | False | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 9 | EXQLWN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10 rows × 111 columns
Basic Dataset information¶
# Shape of the dataframe
df.shape
(4042, 111)
# Name of each column in dataframe
df.columns.tolist()
['Submission ID', 'What is your age?', 'How many cups of coffee do you typically drink per day?', 'Where do you typically drink coffee?', 'Where do you typically drink coffee? (At home)', 'Where do you typically drink coffee? (At the office)', 'Where do you typically drink coffee? (On the go)', 'Where do you typically drink coffee? (At a cafe)', 'Where do you typically drink coffee? (None of these)', 'How do you brew coffee at home?', 'How do you brew coffee at home? (Pour over)', 'How do you brew coffee at home? (French press)', 'How do you brew coffee at home? (Espresso)', 'How do you brew coffee at home? (Coffee brewing machine (e.g. Mr. Coffee))', 'How do you brew coffee at home? (Pod/capsule machine (e.g. Keurig/Nespresso))', 'How do you brew coffee at home? (Instant coffee)', 'How do you brew coffee at home? (Bean-to-cup machine)', 'How do you brew coffee at home? (Cold brew)', 'How do you brew coffee at home? (Coffee extract (e.g. Cometeer))', 'How do you brew coffee at home? (Other)', 'How else do you brew coffee at home?', 'On the go, where do you typically purchase coffee?', 'On the go, where do you typically purchase coffee? (National chain (e.g. Starbucks, Dunkin))', 'On the go, where do you typically purchase coffee? (Local cafe)', 'On the go, where do you typically purchase coffee? (Drive-thru)', 'On the go, where do you typically purchase coffee? (Specialty coffee shop)', 'On the go, where do you typically purchase coffee? (Deli or supermarket)', 'On the go, where do you typically purchase coffee? (Other)', 'Where else do you purchase coffee?', 'What is your favorite coffee drink?', 'Please specify what your favorite coffee drink is', 'Do you usually add anything to your coffee?', 'Do you usually add anything to your coffee? (No - just black)', 'Do you usually add anything to your coffee? (Milk, dairy alternative, or coffee creamer)', 'Do you usually add anything to your coffee? (Sugar or sweetener)', 'Do you usually add anything to your coffee? (Flavor syrup)', 'Do you usually add anything to your coffee? (Other)', 'What else do you add to your coffee?', 'What kind of dairy do you add?', 'What kind of dairy do you add? (Whole milk)', 'What kind of dairy do you add? (Skim milk)', 'What kind of dairy do you add? (Half and half)', 'What kind of dairy do you add? (Coffee creamer)', 'What kind of dairy do you add? (Flavored coffee creamer)', 'What kind of dairy do you add? (Oat milk)', 'What kind of dairy do you add? (Almond milk)', 'What kind of dairy do you add? (Soy milk)', 'What kind of dairy do you add? (Other)', 'What kind of sugar or sweetener do you add?', 'What kind of sugar or sweetener do you add? (Granulated Sugar)', 'What kind of sugar or sweetener do you add? (Artificial Sweeteners (e.g., Splenda))', 'What kind of sugar or sweetener do you add? (Honey)', 'What kind of sugar or sweetener do you add? (Maple Syrup)', 'What kind of sugar or sweetener do you add? (Stevia)', 'What kind of sugar or sweetener do you add? (Agave Nectar)', 'What kind of sugar or sweetener do you add? (Brown Sugar)', 'What kind of sugar or sweetener do you add? (Raw Sugar (Turbinado))', 'What kind of flavorings do you add?', 'What kind of flavorings do you add? (Vanilla Syrup)', 'What kind of flavorings do you add? (Caramel Syrup)', 'What kind of flavorings do you add? (Hazelnut Syrup)', 'What kind of flavorings do you add? (Cinnamon (Ground or Stick))', 'What kind of flavorings do you add? (Peppermint Syrup)', 'What kind of flavorings do you add? (Other)', 'What other flavoring do you use?', "Before today's tasting, which of the following best described what kind of coffee you like?", 'How strong do you like your coffee?', 'What roast level of coffee do you prefer?', 'How much caffeine do you like in your coffee?', 'Lastly, how would you rate your own coffee expertise?', 'Coffee A - Bitterness', 'Coffee A - Acidity', 'Coffee A - Personal Preference', 'Coffee A - Notes', 'Coffee B - Bitterness', 'Coffee B - Acidity', 'Coffee B - Personal Preference', 'Coffee B - Notes', 'Coffee C - Bitterness', 'Coffee C - Acidity', 'Coffee C - Personal Preference', 'Coffee C - Notes', 'Coffee D - Bitterness', 'Coffee D - Acidity', 'Coffee D - Personal Preference', 'Coffee D - Notes', 'Between Coffee A, Coffee B, and Coffee C which did you prefer?', 'Between Coffee A and Coffee D, which did you prefer?', 'Lastly, what was your favorite overall coffee?', 'Do you work from home or in person?', 'In total, much money do you typically spend on coffee in a month?', 'Why do you drink coffee?', 'Why do you drink coffee? (It tastes good)', 'Why do you drink coffee? (I need the caffeine)', 'Why do you drink coffee? (I need the ritual)', 'Why do you drink coffee? (It makes me go to the bathroom)', 'Why do you drink coffee? (Other)', 'Other reason for drinking coffee', 'Do you like the taste of coffee?', 'Do you know where your coffee comes from?', "What is the most you've ever paid for a cup of coffee?", "What is the most you'd ever be willing to pay for a cup of coffee?", 'Do you feel like you’re getting good value for your money when you buy coffee at a cafe?', 'Approximately how much have you spent on coffee equipment in the past 5 years?', 'Do you feel like you’re getting good value for your money with regards to your coffee equipment?', 'Gender', 'Education Level', 'Ethnicity/Race', 'Employment Status', 'Number of Children', 'Political Affiliation']
# Datatype of each column in dataframe
df.dtypes
Submission ID object
What is your age? object
How many cups of coffee do you typically drink per day? object
Where do you typically drink coffee? object
Where do you typically drink coffee? (At home) object
...
Education Level object
Ethnicity/Race object
Employment Status object
Number of Children object
Political Affiliation object
Length: 111, dtype: object
df.isnull().any()
Submission ID False
What is your age? True
How many cups of coffee do you typically drink per day? True
Where do you typically drink coffee? True
Where do you typically drink coffee? (At home) True
...
Education Level True
Ethnicity/Race True
Employment Status True
Number of Children True
Political Affiliation True
Length: 111, dtype: bool
df.describe()
| What kind of flavorings do you add? | What kind of flavorings do you add? (Vanilla Syrup) | What kind of flavorings do you add? (Caramel Syrup) | What kind of flavorings do you add? (Hazelnut Syrup) | What kind of flavorings do you add? (Cinnamon (Ground or Stick)) | What kind of flavorings do you add? (Peppermint Syrup) | What kind of flavorings do you add? (Other) | What other flavoring do you use? | Lastly, how would you rate your own coffee expertise? | Coffee A - Bitterness | ... | Coffee A - Personal Preference | Coffee B - Bitterness | Coffee B - Acidity | Coffee B - Personal Preference | Coffee C - Bitterness | Coffee C - Acidity | Coffee C - Personal Preference | Coffee D - Bitterness | Coffee D - Acidity | Coffee D - Personal Preference | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3938.000000 | 3798.000000 | ... | 3789.000000 | 3780.000000 | 3767.000000 | 3773.000000 | 3764.000000 | 3751.000000 | 3766.000000 | 3767.000000 | 3765.000000 | 3764.000000 |
| mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5.693499 | 2.141127 | ... | 3.310900 | 3.013228 | 2.223786 | 3.068646 | 3.071998 | 2.366836 | 3.064790 | 2.162729 | 3.858167 | 3.375930 |
| std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.948867 | 0.947163 | ... | 1.185953 | 0.992875 | 0.865389 | 1.113546 | 0.999267 | 0.921048 | 1.128431 | 1.081546 | 1.007973 | 1.452504 |
| min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| 25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5.000000 | 1.000000 | ... | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 1.000000 | 3.000000 | 2.000000 |
| 50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6.000000 | 2.000000 | ... | 3.000000 | 3.000000 | 2.000000 | 3.000000 | 3.000000 | 2.000000 | 3.000000 | 2.000000 | 4.000000 | 4.000000 |
| 75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7.000000 | 3.000000 | ... | 4.000000 | 4.000000 | 3.000000 | 4.000000 | 4.000000 | 3.000000 | 4.000000 | 3.000000 | 5.000000 | 5.000000 |
| max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 10.000000 | 5.000000 | ... | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 |
8 rows × 21 columns
Data Cleaning¶
# Dropping columns with more than 70% NaN values
threshold = len(df) * 0.7
df.dropna(thresh=threshold, axis=1)
| Submission ID | What is your age? | How many cups of coffee do you typically drink per day? | Where do you typically drink coffee? | Where do you typically drink coffee? (At home) | Where do you typically drink coffee? (At the office) | Where do you typically drink coffee? (On the go) | Where do you typically drink coffee? (At a cafe) | Where do you typically drink coffee? (None of these) | How do you brew coffee at home? | ... | What is the most you've ever paid for a cup of coffee? | What is the most you'd ever be willing to pay for a cup of coffee? | Do you feel like you’re getting good value for your money when you buy coffee at a cafe? | Approximately how much have you spent on coffee equipment in the past 5 years? | Do you feel like you’re getting good value for your money with regards to your coffee equipment? | Gender | Education Level | Ethnicity/Race | Employment Status | Political Affiliation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | gMR29l | 18-24 years old | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | BkPN0e | 25-34 years old | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Pod/capsule machine (e.g. Keurig/Nespresso) | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | W5G8jj | 25-34 years old | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Bean-to-cup machine | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 4xWgGr | 35-44 years old | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Coffee brewing machine (e.g. Mr. Coffee) | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | QD27Q8 | 25-34 years old | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Pour over | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4037 | PA44VP | >65 years old | 2 | At home | True | False | False | False | False | Coffee brewing machine (e.g. Mr. Coffee) | ... | $6-$8 | $4-$6 | No | Less than $20 | Yes | Female | Master's degree | White/Caucasian | Retired | Democrat |
| 4038 | vNgpPD | >65 years old | 2 | At home | True | False | False | False | False | Coffee brewing machine (e.g. Mr. Coffee) | ... | $4-$6 | $2-$4 | No | Less than $20 | Yes | Male | Bachelor's degree | White/Caucasian | Retired | Republican |
| 4039 | g5ggRM | 18-24 years old | 1 | At a cafe, At home, On the go, At the office | True | True | True | True | False | Espresso, Pod/capsule machine (e.g. Keurig/Nes... | ... | $8-$10 | More than $20 | Yes | $300-$500 | Yes | Male | Some college or associate's degree | White/Caucasian | Employed full-time | Democrat |
| 4040 | rlgbDN | 25-34 years old | 2 | At home | True | False | False | False | False | Pour over | ... | $4-$6 | $8-$10 | Yes | $100-$300 | Yes | Male | Bachelor's degree | White/Caucasian | Unemployed | Democrat |
| 4041 | 0EGYe9 | 25-34 years old | 1 | At home | True | False | False | False | False | Pour over, French press, Espresso, Other | ... | $15-$20 | $15-$20 | Yes | $500-$1000 | Yes | Female | Doctorate or professional degree | White/Caucasian | Employed full-time | Democrat |
4042 rows × 67 columns
# Step 2: Drop rows with too many NaN values (more than 50%)
df = df.dropna(thresh=df.shape[1] * 0.5, axis=0)
# Step 4: Remove any duplicate rows
df.drop_duplicates()
| Submission ID | What is your age? | How many cups of coffee do you typically drink per day? | Where do you typically drink coffee? | Where do you typically drink coffee? (At home) | Where do you typically drink coffee? (At the office) | Where do you typically drink coffee? (On the go) | Where do you typically drink coffee? (At a cafe) | Where do you typically drink coffee? (None of these) | How do you brew coffee at home? | ... | What is the most you'd ever be willing to pay for a cup of coffee? | Do you feel like you’re getting good value for your money when you buy coffee at a cafe? | Approximately how much have you spent on coffee equipment in the past 5 years? | Do you feel like you’re getting good value for your money with regards to your coffee equipment? | Gender | Education Level | Ethnicity/Race | Employment Status | Number of Children | Political Affiliation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15 | Zd694B | <18 years old | 3 | At home, At the office, At a cafe | True | True | False | True | False | Pour over, Espresso, Instant coffee | ... | NaN | NaN | NaN | NaN | Other | Bachelor's degree | Other | Employed full-time | More than 3 | Democrat |
| 17 | QA5JYA | 25-34 years old | 1 | At home, At the office, On the go | True | True | True | False | False | Pour over, Coffee brewing machine (e.g. Mr. Co... | ... | NaN | NaN | NaN | NaN | Female | Bachelor's degree | White/Caucasian | Employed full-time | NaN | Democrat |
| 34 | ylqbBg | 45-54 years old | 2 | At home, At the office, At a cafe, On the go | True | True | True | True | False | Pour over, French press, Espresso | ... | $8-$10 | No | $500-$1000 | Yes | Male | Master's degree | Other | Employed full-time | 2 | No affiliation |
| 39 | BGboZR | 18-24 years old | 3 | At home, At a cafe, On the go | True | False | True | True | False | Pour over, French press, Espresso, Other, Cold... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 41 | YZzBdN | 25-34 years old | 2 | At home, At the office | True | True | False | False | False | Pour over, Espresso | ... | More than $20 | Yes | $50-$100 | Yes | Male | Master's degree | Asian/Pacific Islander | Unemployed | NaN | Independent |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4037 | PA44VP | >65 years old | 2 | At home | True | False | False | False | False | Coffee brewing machine (e.g. Mr. Coffee) | ... | $4-$6 | No | Less than $20 | Yes | Female | Master's degree | White/Caucasian | Retired | 2 | Democrat |
| 4038 | vNgpPD | >65 years old | 2 | At home | True | False | False | False | False | Coffee brewing machine (e.g. Mr. Coffee) | ... | $2-$4 | No | Less than $20 | Yes | Male | Bachelor's degree | White/Caucasian | Retired | 2 | Republican |
| 4039 | g5ggRM | 18-24 years old | 1 | At a cafe, At home, On the go, At the office | True | True | True | True | False | Espresso, Pod/capsule machine (e.g. Keurig/Nes... | ... | More than $20 | Yes | $300-$500 | Yes | Male | Some college or associate's degree | White/Caucasian | Employed full-time | NaN | Democrat |
| 4040 | rlgbDN | 25-34 years old | 2 | At home | True | False | False | False | False | Pour over | ... | $8-$10 | Yes | $100-$300 | Yes | Male | Bachelor's degree | White/Caucasian | Unemployed | NaN | Democrat |
| 4041 | 0EGYe9 | 25-34 years old | 1 | At home | True | False | False | False | False | Pour over, French press, Espresso, Other | ... | $15-$20 | Yes | $500-$1000 | Yes | Female | Doctorate or professional degree | White/Caucasian | Employed full-time | 1 | Democrat |
3650 rows × 111 columns
Renaming Column Headers¶
# Sample renaming of specific columns
df.rename(columns ={
'How do you brew coffee at home? (Pour over)':'PourOver',
'How do you brew coffee at home? (French press)':'FrenchPress',
'How do you brew coffee at home? (Espresso)':'Espresso',
'How do you brew coffee at home? (Coffee brewing machine (e.g. Mr. Coffee))':'CoffeeBrewingMachine',
'How do you brew coffee at home? (Pod/capsule machine (e.g. Keurig/Nespresso))':'CapsuleMachine',
'How do you brew coffee at home? (Instant coffee)':'InstantCoffee',
'How do you brew coffee at home? (Bean-to-cup machine)':'BeanToCupMachine',
'How do you brew coffee at home? (Cold brew)':'ColdBrew',
'How do you brew coffee at home? (Coffee extract (e.g. Cometeer))':'CoffeeExtract',
'How do you brew coffee at home? (Other)':'OtherMachine'},inplace =True)
# Sample renaming of specific columns
df.rename(columns ={'On the go, where do you typically purchase coffee? (National chain (e.g. Starbucks, Dunkin))':'NationalCahin',
'On the go, where do you typically purchase coffee? (Local cafe)':'LocalCafe',
'On the go, where do you typically purchase coffee? (Drive-thru)':'DeiveThru',
'On the go, where do you typically purchase coffee? (Specialty coffee shop)':'SpecialtyCoffeeShop',
'On the go, where do you typically purchase coffee? (Deli or supermarket)':'SuperMarket',
'On the go, where do you typically purchase coffee? (Other)':'OtherLocation'},inplace =True)
# Sample renaming of specific columns
df.rename(columns ={
'Where do you typically drink coffee? (At home)':'AtHome',
'Where do you typically drink coffee? (At the office)':'AtOffice',
'Where do you typically drink coffee? (On the go)':'OnTheGo',
'Where do you typically drink coffee? (At a cafe)':'AtCafe',
'Where do you typically drink coffee? (None of these)':'NoneOfThese'},inplace =True)
# Sample renaming of specific columns
df.rename(columns ={
'What kind of sugar or sweetener do you add? (Artificial Sweeteners (e.g., Splenda))':'Artificial Sweeteners',
'What kind of sugar or sweetener do you add? (Granulated Sugar)':'GranulatedSugar',
'What kind of sugar or sweetener do you add? (Honey)':'Honey',
'What kind of sugar or sweetener do you add? (Maple Syrup)':'MapleSyrup',
'What kind of sugar or sweetener do you add? (Stevia)':'Stevia',
'What kind of sugar or sweetener do you add? (Agave Nectar)':'AgaveNectar',
'What kind of sugar or sweetener do you add? (Brown Sugar)':'BrownSugar',
'What kind of sugar or sweetener do you add? (Raw Sugar (Turbinado))':'RawSugar'},inplace =True)
# Sample renaming of specific columns
df.rename(columns ={'What kind of dairy do you add? (Whole milk)':'WholeMilk',
'What kind of dairy do you add? (Skim milk)':'SkimMilk',
'What kind of dairy do you add? (Half and half)':'HalfAndHalf',
'What kind of dairy do you add? (Coffee creamer)':'CoffeeCreamer',
'What kind of dairy do you add? (Flavored coffee creamer)':'FalavoredCoffeeCreamer',
'What kind of dairy do you add? (Oat milk)':'OatMilk',
'What kind of dairy do you add? (Almond milk)':'AlmondMilk',
'What kind of dairy do you add? (Soy milk)':'SoyMilk',
'What kind of dairy do you add? (Other)':'OtherMilk'},inplace =True)
Merging Sub-Columns¶
brewing_columns = [
'PourOver',
'FrenchPress',
'Espresso',
'CoffeeBrewingMachine',
'CapsuleMachine',
'InstantCoffee',
'BeanToCupMachine',
'ColdBrew',
'CoffeeExtract',
'OtherMachine',
]
# Convert all brewing method columns to boolean (True for non-zero/non-empty, False for NaN or 0)
df[brewing_columns] = df[brewing_columns].notna() & df[brewing_columns].astype(bool)
# Create the main column with the names of the brewing methods used
df['How do you brew coffee at home?'] = df[brewing_columns].apply(
lambda row: ', '.join([col for col, val in row.items() if val]), axis=1
)
location =[
'AtHome',
'AtOffice',
'OnTheGo',
'AtCafe',
'NoneOfThese']
# Convert all brewing method columns to boolean (True for non-zero/non-empty, False for NaN or 0)
df[location] = df[location].notna() & df[location].astype(bool)
# Create the main column with the names of the brewing methods used
df['Where do you typically drink coffee?'] = df[location].apply(
lambda row: ', '.join([col for col, val in row.items() if val]), axis=1
)
buylocation =[
'NationalCahin',
'LocalCafe',
'DeiveThru',
'SpecialtyCoffeeShop',
'SuperMarket',
'OtherLocation']
# Convert all brewing method columns to boolean (True for non-zero/non-empty, False for NaN or 0)
df[buylocation] = df[buylocation].notna() & df[buylocation].astype(bool)
# Create the main column with the names of the brewing methods used
df['On the go, where do you typically purchase coffee?'] = df[buylocation].apply(
lambda row: ', '.join([col for col, val in row.items() if val]), axis=1)
#Replacing categorical values into numerical values
df['What is your age?'] = df['What is your age?'].str.replace('years old','')
visualization Data Insights¶
# Replace 'NULL' values with 'Unknown'
df['What is your age?']==df['What is your age?'].replace('NULL', '')
df['What is your age?']==df['What is your age?'].replace('years old', ' ')
df['What is your age?'] = df['What is your age?'].str.replace('years old','')
colors = ['goldenrod', 'lightblue', 'thistle', 'olivedrab', 'coral', 'mediumseagreen', 'slateblue']
# Count the occurrences using Seaborn's countplot
plt.figure(figsize=(8, 4))
ax = sns.countplot(data=df, x='What is your age?',hue='What is your age?',palette = colors)
# Adding bar labels on top of the bars
for bars in ax.containers:
ax.bar_label(bars)
plt.title("Age Group Distribution")
plt.xlabel("Age Group")
plt.ylabel("Frequency")
plt.show()
# Create a horizontal bar chart
plt.figure(figsize=(10, 5))
ax = sns.countplot(y='Gender', data=df, hue='Gender', palette='viridis')
# Add bar labels
for bars in ax.containers:
ax.bar_label(bars)
# Display the plot
plt.show()
gender_counts = df[df['Gender'].isin(['Male', 'Female'])]['Gender'].value_counts()
# Plotting the pie chart
plt.figure(figsize=(4, 4))
plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%', startangle=140, colors=["SkyBlue", "Coral"])
plt.title("Gender Distribution in Dataset")
plt.show()
# Total Amount spend by gender
df.groupby('Gender')['AveragePrice'].sum()
Gender Female 7180.5 Male 23529.0 Non-binary 1047.0 Other 85.0 Prefer not to say 281.5 Name: AveragePrice, dtype: float64
# Define colors for the plot
colors = ["SkyBlue", "Coral", "Goldenrod", "SeaGreen", "SlateGray"]
# Filter to include only rows where "Where do you typically drink coffee?" is not NaN
coffee_consumers_df = df[df['Where do you typically drink coffee?'].notna()]
# Plot count of coffee consumers by Employment Status and Gender
plt.figure(figsize=(14, 4))
ax = sns.countplot(data=coffee_consumers_df, x='Employment Status', hue='Gender', palette=colors)
# Adding labels on top of each bar
for bars in ax.containers:
ax.bar_label(bars)
# Display the plot
plt.title("Employment Status of Coffee Consumers by Gender")
plt.xlabel("Employment Status")
plt.ylabel("Count of Coffee Consumers")
plt.show()
# get the average price by removing unwanted characters '$' and by dividing lower_bound and upper_bound by '/2'
def parse_price(value):
if pd.isnull(value):
return None
elif '-' in value: # If the value is a range like "$4-$6"
low, high = value.split('-')
return (float(low.replace('$', '').strip()) + float(high.replace('$', '').strip())) / 2
elif "More than" in value: # If the value is "More than 20"
return 20.0 # or choose a higher value like 25
else:
try:
return float(value.replace('$', '').strip()) # Handle single values without ranges
except ValueError:
return None # Default to None if there's an unexpected format
df["AveragePrice"] = df["What is the most you've ever paid for a cup of coffee?"].apply(parse_price)
# Creating subplots of histograms of showing various categories
fig = make_subplots(
rows=4, cols=2,
subplot_titles=("Gender", "What is your age?", "Education Level", "Ethnicity/Race",
"Employment Status", "Number of Children", "Political Affiliation", "AveragePrice")
)
# Add histograms for each subplot
fig.add_trace(go.Histogram(x=df['Gender']), row=1, col=1)
fig.add_trace(go.Histogram(x=df['What is your age?']), row=1, col=2)
fig.add_trace(go.Histogram(x=df['Education Level']), row=2, col=1)
fig.add_trace(go.Histogram(x=df['Ethnicity/Race']), row=2, col=2)
fig.add_trace(go.Histogram(x=df['Employment Status']), row=3, col=1)
fig.add_trace(go.Histogram(x=df['Number of Children']), row=3, col=2)
fig.add_trace(go.Histogram(x=df['Political Affiliation']), row=4, col=1)
fig.add_trace(go.Histogram(x=df['AveragePrice']), row=4, col=2)
# Update layout if needed
fig.update_layout(height=1200, width=1000, title_text="Count Plots")
fig.update_layout(showlegend=False) # Hide the legend if not needed
# Show the figure
fig.show()
#Split the "On the go, where do you typically purchase coffee?" column and explode it
#Split the "On the go, where do you typically purchase coffee?" column and explode it
df_expanded = df.assign(
Purchase_Location=df['On the go, where do you typically purchase coffee?'].str.split(', ')
).explode('Purchase_Location')
#Replace values in the Gender column
df_expanded['Gender'] = df_expanded['Gender'].str.replace('Other', 'Unknown')
#Remove empty entries after exploding
df_expanded = df_expanded[df_expanded['Purchase_Location'] != ""]
#Count occurrences of each location by gender
location_gender_counts = df_expanded.groupby(['Purchase_Location', 'Gender']).size().unstack(fill_value=0)
#Plotting the stacked bar chart
location_gender_counts.plot(kind="bar", stacked=True, figsize=(10, 5), color=['ivory', 'tan', 'olivedrab', 'lightblue'])
plt.title("On-the-Go Coffee Purchase Locations by Gender")
plt.xlabel("Purchase Location")
plt.ylabel("Count of Responses")
plt.legend(title="Gender", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Plotting a pie chart to show the proportion of spending categories
# Grouping the data into spending categories for visualization
spending_categories = df['Approximately how much have you spent on coffee equipment in the past 5 years?'].value_counts(dropna=False)
# Plotting the pie chart
plt.figure(figsize=(6, 6))
plt.pie(spending_categories, labels=spending_categories.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.Paired.colors)
plt.title("Proportion of Spending on Coffee Equipment Over 5 Years")
plt.show()
# Define the columns for which you want to plot value counts
columns = ['How do you brew coffee at home?', 'What kind of dairy do you add?', 'What kind of sugar or sweetener do you add?', 'What is your favorite coffee drink?']
# Set up a 2x2 grid for subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # 2 rows, 2 columns
axes = axes.flatten() # Flatten to easily access each subplot
# Loop through each column, split, explode, and create bar charts
for i, column in enumerate(columns):
#Split the comma-separated values and explode, then drop null or empty values
df_expanded = df[column].str.split(', ').explode().dropna()
df_expanded = df_expanded[df_expanded != ''] # Remove empty strings
#Get the top 3 most common values
top_3_values = df_expanded.value_counts().nlargest(3)
#Plot the bar chart for the top 3 values
ax = axes[i]
ax.bar(top_3_values.index, top_3_values.values,color=['indigo', 'khaki', 'lavender']) # Create vertical bar chart
ax.set_title(f'Top 3 Most Frequent {column}')
ax.set_xlabel(column)
ax.set_ylabel('Count')
ax.tick_params(axis='x', rotation=45) # Rotate x-axis labels for readability
# Adjust layout and spacing
plt.tight_layout(pad=3.0) # Add padding to reduce overlap between subplots
plt.show()
Final Insights¶
After analyzing the data, we have gathered key insights about customer coffee spending patterns based on age, gender, education status,ethincity race.¶
Actionable Insights¶
• For Age feature, we observed that ~ 1844 of the customer's who belong to the age group 25-34 (~ 882: 35-44,~398: 18-24, ~163: 55-64) tend to spend the most.
• For Gender feature, ~75% of the number of purchases are made by Male customer's and rest of the 25% is done by female customer's. This tells us the Male consumers are the major contributors to the number of sales for the Coffee Sales.On average the male gender spends more money on purchase contrary to female, and it is possible to also observe this trend by adding the total value of purchase.18
•Average amount spent by Male customers: 23529.0
•Average amount spent by Female customers: 7180.5
• When we combined Purchase and EducationalStatus for analysis (2050 are Males and 563 females are contributed from the Employed Category. We came to know that Males spend the most during the Employed Face. It also tells that Men tend to spend less once they are HomeMaker. It maybe because of the added responsibilities.
Recomendations¶
- Men spend more money on coffee than women. The company should focus on promotions and offers targeted at female customers to attract more female customers and increase their spending.
- Customers in the age group of 25-34 spend more money than other age groups. The company should focus on acquiring customers from other age groups to broaden its customer base.
- Customers mostly prefer buying coffee from Specialty Coffee Shops, with local shops being the least popular choice.
- Customers primarily come from educational backgrounds, such as Bachelor's or Master's degree holders, and the majority are from White ethnic backgrounds.
- Over 21% of customers with more than 5 years of coffee consumption spend over $1000 annually. Most customers prefer whole milk and granulated sugar in their coffee.